The demo shows how you can generate the synthetic population of China using SyntheticPopulation.jl package.
include("../src/SyntheticPopulation.jl")
using PyCall
using Colors
folium = pyimport("folium")
PyObject <module 'folium' from '/Users/marcinzurek/.julia/conda/3/x86_64/lib/python3.10/site-packages/folium/__init__.py'>
For the purpose of this demo, we are using data from:
The data is input manually in a form of Julia DataFrame. We are using the following tables:
Because of large population of Beijing, the data is scales by 0.1% to decrease computational costs.
Each of dataframes provides information about the distribution of population. We can only access marginal distributions of attributes (e.g. population by age and sex).
#each individual and each household represent 1000 individuals or households
SCALE = 0.001
0.001
#all values are based on China census data
popoulation_size = 21890000
#individuals
marginal_ind_age_sex = DataFrame(
sex = repeat(['M', 'F'], 18),
age = repeat(2:5:87, inner = 2),
population = SCALE .* 10000 .* [52.6, 49.0, 48.5, 44.8, 33.6, 30.6, 34.6, 28.8, 71.6, 63.4, 99.6, 90.9, 130.9, 119.4, 110.8, 103.5, 83.8, 76.4, 84.2, 77.7, 84.2, 77.8, 82.8, 79.9, 67.7, 71.0, 56.9, 62.6, 31.5, 35.3, 18.5, 23.0, 15.2, 19.7, 12.5, 16.0]
)
marginal_ind_sex_maritialstatus = DataFrame(
sex = repeat(['M', 'F'], 4),
maritialstatus = repeat(["Not_married", "Married", "Divorced", "Widowed"], inner = 2),
population = SCALE .* [1679, 1611, 5859, 5774, 140, 206, 128, 426] ./ 0.00082
)
marginal_ind_income = DataFrame(
income = [25394, 44855, 63969, 88026, 145915],
population = repeat([popoulation_size * SCALE / 5], 5)
)
#households
marginal_hh_size = DataFrame(
hh_size = [1,2,3,4,5],
population = Int.(round.(SCALE * 8230000 .* [0.299, 0.331, 0.217, 0.09, 0.063]))
)
| Row | hh_size | population |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 2461 |
| 2 | 2 | 2724 |
| 3 | 3 | 1786 |
| 4 | 4 | 741 |
| 5 | 5 | 518 |
Next, we generate a Julia DataFrame which presents population breakdown per each district. This consists of 2 steps:
#areas
URL = "https://osm-boundaries.com/Download/Submit?apiKey=87100809b4085adb58139419c141e5a1&db=osm20230102&osmIds=-2988894,-2988933,-2988895,-288600,-2988896,-2988946,-5505984,-2988897,-2988898,-2988899,-2988900,-5505985,-2988901,-2988902,-568660,-2988903&format=GeoJSON&srid=4326"
areas = generate_areas_dataframe(URL)
#aggregated_areas - population referenced from https://nj.tjj.beijing.gov.cn/nj/main/2021-tjnj/zk/indexeh.htm
aggregated_areas = copy(areas)
aggregated_areas.:population = SCALE .* 10000 .* [56.8, 313.2, 201.9, 345.1, 34.6, 184.0, 132.4, 45.7, 52.8, 39.3, 44.1, 131.3, 199.4, 226.9, 110.6, 70.9]
aggregated_areas = aggregated_areas[:, [:id, :population]]
areas
Downloading file... File downloaded. Unzipping file... File saved at /Users/marcinzurek/Desktop/Studia/Research/SyntheticPopulation/notebooks/file.geojson
| Row | id | geometry | name |
|---|---|---|---|
| Int64 | Geometry | String | |
| 1 | 1 | Polygon | Shijingshan District |
| 2 | 2 | Polygon | Haidian District |
| 3 | 3 | Polygon | Fengtai District |
| 4 | 4 | MultiPolygon | Chaoyang District |
| 5 | 5 | Polygon | Yanqing District |
| 6 | 6 | Polygon | Tongzhou District |
| 7 | 7 | Polygon | Shunyi District |
| 8 | 8 | Polygon | Pinggu District |
| 9 | 9 | Polygon | Miyun District |
| 10 | 10 | Polygon | Mentougou District |
| 11 | 11 | Polygon | Huairou District |
| 12 | 12 | Polygon | Fangshan District |
| 13 | 13 | Polygon | Daxing District |
| 14 | 14 | Polygon | Changping District |
| 15 | 15 | Polygon | Xicheng District |
| 16 | 16 | Polygon | Dongcheng District |
As the next step, we merge marginal distributions of attributes of individuals to obtain join distributions for attribute combinations. This is done using a recursive algorithm that leverages Iterative Proportional Fitting Method proposed by Guo, Bhat, 2007 [1]. In addition, our algorithm is improved because it can be configured with a JSON file to provide more flexibility. This approach is insipired by Ponge, Enbergs et al. 2021. The configuration documentation is described elsewhere.
[1] Guo, J. Y., & Bhat, C. R. (2007). Population synthesis for microsimulating travel behavior. Transportation Research Record, 2014(1), 92-101.
[2] Ponge, J., Enbergs, M., Schüngel, M., Hellingrath, B., Karch, A., & Ludwig, S. (2021, December). Generating synthetic populations based on german census data. In 2021 Winter Simulation Conference (WSC) (pp. 1-12). IEEE.
individuals, aggregated_individuals = generate_joint_distributions(marginal_ind_sex_maritialstatus, marginal_ind_age_sex, marginal_ind_income, config_file = "config_file.json")
[ Info: Inconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [9520, 9502] [ Info: Converged in 1 iterations. [ Info: Inconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [9777, 9166] [ Info: Converged in 1 iterations. [ Info: Inconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [18668, 21890] [ Info: Converged in 1 iterations.
(568×5 DataFrame Row │ id maritialstatus sex age income │ Int64 String? Char? Int64? Int64? ─────┼─────────────────────────────────────────────── 1 │ 1 Not_married M 22 25394 2 │ 2 Married M 22 25394 3 │ 3 Divorced M 22 25394 4 │ 4 Widowed M 22 25394 5 │ 5 Not_married F 22 25394 6 │ 6 Married F 22 25394 7 │ 7 Divorced F 22 25394 8 │ 8 Widowed F 22 25394 9 │ 9 Not_married M 27 25394 10 │ 10 Married M 27 25394 11 │ 11 Divorced M 27 25394 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ 559 │ 559 Divorced F 87 145915 560 │ 560 Widowed F 87 145915 561 │ 561 missing F 2 missing 562 │ 562 missing F 7 missing 563 │ 563 missing F 12 missing 564 │ 564 missing F 17 missing 565 │ 565 missing M 2 missing 566 │ 566 missing M 7 missing 567 │ 567 missing M 12 missing 568 │ 568 missing M 17 missing 547 rows omitted, 568×2 DataFrame Row │ id population │ Int64 Int64 ─────┼─────────────────── 1 │ 1 31 2 │ 2 107 3 │ 3 3 4 │ 4 2 5 │ 5 43 6 │ 6 150 7 │ 7 4 8 │ 8 3 9 │ 9 56 10 │ 10 196 11 │ 11 5 ⋮ │ ⋮ ⋮ 559 │ 559 1 560 │ 560 2 561 │ 561 490 562 │ 562 448 563 │ 563 306 564 │ 564 288 565 │ 565 526 566 │ 566 485 567 │ 567 336 568 │ 568 346 547 rows omitted)
We do the same for a dataframe with households to generate households pool.
households, aggregated_households = generate_joint_distributions(marginal_hh_size)
(5×2 DataFrame Row │ id hh_size │ Int64 Int64 ─────┼──────────────── 1 │ 1 1 2 │ 2 2 3 │ 3 3 4 │ 4 4 5 │ 5 5, 5×2 DataFrame Row │ id population │ Int64 Int64 ─────┼─────────────────── 1 │ 1 2461 2 │ 2 2724 3 │ 3 1786 4 │ 4 741 5 │ 5 518)
Once the individual pool and household pool is generated, we assign individuals to households. The individuals are assigned to households based on the following rules:
disaggregated_households, unassigned = assign_individuals_to_households(individuals, aggregated_individuals, households, aggregated_households, return_unassigned = true)
Total_number of individuals: 21885 Total_number of households: 8230 Allocation started...
Progress: 100%|█████████████████████████████████████████| Time: 0:00:06
Allocated 86.0% individuals. Allocated 100.0% households.
(8230×9 DataFrame Row │ id hh_attr_id hh_size individuals_allocated head_id partner_id ⋯ │ Int64 Int64 Int64 Bool Int64 Int64 ⋯ ──────┼───────────────────────────────────────────────────────────────────────── 1 │ 1 1 1 true 489 0 ⋯ 2 │ 2 1 1 true 493 0 3 │ 3 1 1 true 233 0 4 │ 4 1 1 true 152 0 5 │ 5 1 1 true 287 0 ⋯ 6 │ 6 1 1 true 412 0 7 │ 7 1 1 true 457 0 8 │ 8 1 1 true 532 0 9 │ 9 1 1 true 245 0 ⋯ 10 │ 10 1 1 true 237 0 11 │ 11 1 1 true 75 0 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 8221 │ 8221 5 5 true 394 222 8222 │ 8222 5 5 true 290 246 ⋯ 8223 │ 8223 5 5 true 474 78 8224 │ 8224 5 5 true 62 410 8225 │ 8225 5 5 true 10 326 8226 │ 8226 5 5 true 70 282 ⋯ 8227 │ 8227 5 5 true 346 254 8228 │ 8228 5 5 true 242 286 8229 │ 8229 5 5 true 286 306 8230 │ 8230 5 5 true 234 398 ⋯ 3 columns and 8209 rows omitted, Dict{String, DataFrame}("unassigned_individuals" => 471×2 DataFrame Row │ id population │ Int64 Int64 ─────┼─────────────────── 1 │ 1 3 2 │ 5 1 3 │ 6 3 4 │ 9 21 5 │ 10 8 6 │ 11 2 7 │ 12 3 8 │ 13 23 9 │ 14 3 10 │ 15 2 11 │ 16 1 ⋮ │ ⋮ ⋮ 462 │ 557 2 463 │ 558 1 464 │ 561 16 465 │ 562 12 466 │ 563 8 467 │ 564 6 468 │ 565 12 469 │ 566 8 470 │ 567 6 471 │ 568 12 450 rows omitted, "unassigned_households" => 0×2 DataFrame Row │ id population │ Int64 Int64 ─────┴───────────────────))
Once each household is filled with individuals of certain type, we assign coordinates for each household. This is done in 2 steps:
disaggregated_households, unassigned = assign_areas_to_households!(disaggregated_households, areas, aggregated_areas, return_unassigned = true)
disaggregated_households
=================== Assigning coordinates to households... ===================
Progress: 100%|█████████████████████████████████████████| Time: 0:00:29
| Row | id | hh_attr_id | individuals_allocated | head_id | partner_id | child1_id | child2_id | child3_id | lat | lon | area_id |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Int64 | Bool | Int64 | Int64 | Int64 | Int64 | Int64 | Float64 | Float64 | Int64 | |
| 1 | 1 | 1 | true | 489 | 0 | 0 | 0 | 0 | 39.8338 | 116.21 | 3 |
| 2 | 2 | 1 | true | 493 | 0 | 0 | 0 | 0 | 39.8864 | 116.736 | 6 |
| 3 | 3 | 1 | true | 233 | 0 | 0 | 0 | 0 | 39.8994 | 116.521 | 4 |
| 4 | 4 | 1 | true | 152 | 0 | 0 | 0 | 0 | 39.9508 | 116.342 | 2 |
| 5 | 5 | 1 | true | 287 | 0 | 0 | 0 | 0 | 39.6194 | 115.582 | 12 |
| 6 | 6 | 1 | true | 412 | 0 | 0 | 0 | 0 | 39.7954 | 116.423 | 3 |
| 7 | 7 | 1 | true | 457 | 0 | 0 | 0 | 0 | 39.5492 | 116.361 | 13 |
| 8 | 8 | 1 | true | 532 | 0 | 0 | 0 | 0 | 39.6505 | 116.843 | 6 |
| 9 | 9 | 1 | true | 245 | 0 | 0 | 0 | 0 | 39.8015 | 116.353 | 3 |
| 10 | 10 | 1 | true | 237 | 0 | 0 | 0 | 0 | 39.7172 | 116.713 | 6 |
| 11 | 11 | 1 | true | 75 | 0 | 0 | 0 | 0 | 39.947 | 116.279 | 2 |
| 12 | 12 | 1 | true | 401 | 0 | 0 | 0 | 0 | 40.2152 | 116.699 | 7 |
| 13 | 13 | 1 | true | 9 | 0 | 0 | 0 | 0 | 40.266 | 116.334 | 14 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 8219 | 8219 | 5 | true | 234 | 310 | 1 | 343 | 454 | 39.8441 | 116.525 | 4 |
| 8220 | 8220 | 5 | true | 482 | 350 | 568 | 342 | 6 | 39.9024 | 116.41 | 16 |
| 8221 | 8221 | 5 | true | 394 | 222 | 567 | 454 | 561 | 40.0025 | 116.262 | 2 |
| 8222 | 8222 | 5 | true | 290 | 246 | 561 | 568 | 230 | 39.8841 | 116.334 | 15 |
| 8223 | 8223 | 5 | true | 474 | 78 | 566 | 563 | 564 | 40.0491 | 116.312 | 2 |
| 8224 | 8224 | 5 | true | 62 | 410 | 338 | 454 | 562 | 39.8415 | 116.391 | 3 |
| 8225 | 8225 | 5 | true | 10 | 326 | 230 | 118 | 568 | 39.619 | 116.43 | 13 |
| 8226 | 8226 | 5 | true | 70 | 282 | 561 | 567 | 562 | 39.6206 | 116.243 | 13 |
| 8227 | 8227 | 5 | true | 346 | 254 | 564 | 568 | 562 | 40.0993 | 116.132 | 2 |
| 8228 | 8228 | 5 | true | 242 | 286 | 567 | 565 | 561 | 40.1734 | 116.287 | 14 |
| 8229 | 8229 | 5 | true | 286 | 306 | 563 | 563 | 338 | 40.0472 | 116.236 | 2 |
| 8230 | 8230 | 5 | true | 234 | 398 | 565 | 566 | 567 | 40.0092 | 116.385 | 4 |
In the dataframe disaggregated_households each row represents one household. For each household we assigned some household members, each of them being characterized by a set of attributes. Also, each household has coordinates that show the location of the household.
Let's visualise it:
m = folium.Map(location = [disaggregated_households.lat[1], disaggregated_households.lon[1]], zoom_start=11)
i = 1
for area in unique(disaggregated_households.area_id)
colrs = distinguishable_colors(length(unique(disaggregated_households.area_id)), [RGB(1,0.6,0.5)])
hh_color = "#$(hex(colrs[i]))"
i += 1
area = filter(row -> row.area_id == area, disaggregated_households)
for i in 1:nrow(area)
folium.Circle(
location = (area.lat[i], area.lon[i]),
radius = 100,
color = hh_color,
fill = false,
fill_color = hh_color
).add_to(m)
end
end
m